The BRC is: Basic BRC plug.
Every dataset is different. Sometimes very different.
There are many ways to do things. Everyone has their favorite syntax.
The issue:
Many fundamental data processing functions exist in Base R and beyond. Sometimes they can be inconsistent or unnecessarily complex. The result is code that is confusing and doesn’t flow i.e. nested functions
Tidyverse is most importantly a philosophy for data analysis that more often then not makes wrangling data easier. The tidyverse community have built what they describe as an “opinionated” group of packages. These packages readily talk to one another.
Their manifesto: https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html
Other tools have now been made for the tidy community. This community also overlaps with bioconductor. But the packages above are the linchpins that hold it together.
Workflow Image for working with data.
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## 1 35032 Chinook salmon yearling 147 41.26006
## 2 35035 Sockeye salmon juvenile 121 NA
## 3 35036 Sockeye salmon juvenile 112 NA
## 4 35037 Steelhead juvenile 220 42.70981
## 5 35038 Steelhead juvenile 152 NA
## 6 35033 Chinook salmon mixed age juvenile 444 62.11528
## salmon_id common_name age_classbylength variable value
## 1 35032 Chinook salmon yearling length_mm 147.00000
## 2 35032 Chinook salmon yearling IGF1_ng_ml 41.26006
## 3 35033 Chinook salmon mixed age juvenile length_mm 444.00000
## 4 35033 Chinook salmon mixed age juvenile IGF1_ng_ml 62.11528
## 5 35034 Sockeye salmon juvenile length_mm 139.00000
## 6 35034 Sockeye salmon juvenile IGF1_ng_ml NA
## salmon_id common_name age_classbylength variable value
## 1 35032 Chinook salmon yearling length_mm 147
## 2 35033 Chinook salmon mixed age juvenile length_mm 444
## 3 35034 Sockeye salmon juvenile length_mm 139
## 4 35035 Sockeye salmon juvenile length_mm 121
## 5 35036 Sockeye salmon juvenile length_mm 112
## 6 35037 Steelhead juvenile length_mm 220
## salmon_id common_name age_classbylength variable value
## 1 35032 Chinook salmon yearling IGF1_ng_ml 41.26006
## 2 35033 Chinook salmon mixed age juvenile IGF1_ng_ml 62.11528
## 3 35034 Sockeye salmon juvenile IGF1_ng_ml NA
## 4 35035 Sockeye salmon juvenile IGF1_ng_ml NA
## 5 35036 Sockeye salmon juvenile IGF1_ng_ml NA
## 6 35037 Steelhead juvenile IGF1_ng_ml 42.70981
A tidy dataset is a data frame (or table) for which the following are true:
Our first dataframe is tidy
Consistent dataframe layouts help to ensure that all values are present and that relationships between data points are clear.
R is a vectorized programming language. R builds data frames from vectors, and R works best when its operation are vectorized. Tidy data utilizes of both of these aspects of R.
=> Precise and Fast
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
-> insert graphic here https://github.com/trinker/tidyr_in_a_nutshell
Select allows you to make a vector or dataframe from a specific variable or variables
## # A tibble: 97 x 1
## common_name
## <chr>
## 1 Chinook salmon
## 2 Sockeye salmon
## 3 Sockeye salmon
## 4 Steelhead
## 5 Steelhead
## 6 Chinook salmon
## 7 Sockeye salmon
## 8 Steelhead
## 9 Steelhead
## 10 Steelhead
## # … with 87 more rows
# Select two variables (age_classbylength and common_name)
select(df1, age_classbylength, common_name)## # A tibble: 97 x 2
## age_classbylength common_name
## <chr> <chr>
## 1 yearling Chinook salmon
## 2 juvenile Sockeye salmon
## 3 juvenile Sockeye salmon
## 4 juvenile Steelhead
## 5 juvenile Steelhead
## 6 mixed age juvenile Chinook salmon
## 7 juvenile Sockeye salmon
## 8 juvenile Steelhead
## 9 juvenile Steelhead
## 10 juvenile Steelhead
## # … with 87 more rows
## # A tibble: 97 x 4
## salmon_id common_name age_classbylength IGF1_ng_ml
## <dbl> <chr> <chr> <dbl>
## 1 35032 Chinook salmon yearling 41.3
## 2 35035 Sockeye salmon juvenile NA
## 3 35036 Sockeye salmon juvenile NA
## 4 35037 Steelhead juvenile 42.7
## 5 35038 Steelhead juvenile NA
## 6 35033 Chinook salmon mixed age juvenile 62.1
## 7 35034 Sockeye salmon juvenile NA
## 8 35048 Steelhead juvenile 24.2
## 9 35049 Steelhead juvenile NA
## 10 35050 Steelhead juvenile 63.5
## # … with 87 more rows
# Select all a range of contiguous varibles (common_name:length_mm)
select(df1, common_name:length_mm)## # A tibble: 97 x 3
## common_name age_classbylength length_mm
## <chr> <chr> <dbl>
## 1 Chinook salmon yearling 147
## 2 Sockeye salmon juvenile 121
## 3 Sockeye salmon juvenile 112
## 4 Steelhead juvenile 220
## 5 Steelhead juvenile 152
## 6 Chinook salmon mixed age juvenile 444
## 7 Sockeye salmon juvenile 139
## 8 Steelhead juvenile 288
## 9 Steelhead juvenile 190
## 10 Steelhead juvenile 283
## # … with 87 more rows
Filter allows you to access observations based on specific criteria
# Filter all observation where the variable common_name is Sockeye salmon
filter(df1, common_name == 'Sockeye salmon')## # A tibble: 11 x 5
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 35035 Sockeye salmon juvenile 121 NA
## 2 35036 Sockeye salmon juvenile 112 NA
## 3 35034 Sockeye salmon juvenile 139 NA
## 4 35144 Sockeye salmon juvenile 140 NA
## 5 35147 Sockeye salmon juvenile 115 NA
## 6 35096 Sockeye salmon juvenile 115 NA
## 7 35097 Sockeye salmon juvenile 110 NA
## 8 35098 Sockeye salmon juvenile 112 NA
## 9 35099 Sockeye salmon juvenile 111 NA
## 10 35100 Sockeye salmon juvenile 118 NA
## 11 35119 Sockeye salmon juvenile 122 NA
# Filter all observations where the variable common_name is either Sockeye salmon or Chinook Salmon
filter(df1, common_name %in% c('Sockeye salmon', 'Chinook salmon'))## # A tibble: 57 x 5
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 35032 Chinook salmon yearling 147 41.3
## 2 35035 Sockeye salmon juvenile 121 NA
## 3 35036 Sockeye salmon juvenile 112 NA
## 4 35033 Chinook salmon mixed age juvenile 444 62.1
## 5 35034 Sockeye salmon juvenile 139 NA
## 6 35142 Chinook salmon yearling 149 66.5
## 7 35143 Chinook salmon yearling 204 80.9
## 8 35144 Sockeye salmon juvenile 140 NA
## 9 35145 Chinook salmon yearling 130 23.4
## 10 35146 Chinook salmon mixed age juvenile 422 101.
## # … with 47 more rows
# Filter all observations where the variable common_name ends with 'salmon'. To do this we use stringr function str_ends recognise strings that end with 'salmon'.
filter(df1, str_ends(common_name, 'salmon'))## # A tibble: 59 x 5
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 35032 Chinook salmon yearling 147 41.3
## 2 35035 Sockeye salmon juvenile 121 NA
## 3 35036 Sockeye salmon juvenile 112 NA
## 4 35033 Chinook salmon mixed age juvenile 444 62.1
## 5 35034 Sockeye salmon juvenile 139 NA
## 6 35142 Chinook salmon yearling 149 66.5
## 7 35143 Chinook salmon yearling 204 80.9
## 8 35144 Sockeye salmon juvenile 140 NA
## 9 35145 Chinook salmon yearling 130 23.4
## 10 35146 Chinook salmon mixed age juvenile 422 101.
## # … with 49 more rows
# Filter all observations where the variable length_mm is greater than 200 or less than 120
filter(df1, length_mm > 200 | length_mm < 120)## # A tibble: 36 x 5
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 35036 Sockeye salmon juvenile 112 NA
## 2 35037 Steelhead juvenile 220 42.7
## 3 35033 Chinook salmon mixed age juvenile 444 62.1
## 4 35048 Steelhead juvenile 288 24.2
## 5 35050 Steelhead juvenile 283 63.5
## 6 35051 Steelhead juvenile 279 61.2
## 7 35052 Steelhead juvenile 235 30.6
## 8 35053 Steelhead juvenile 230 49.4
## 9 35056 Steelhead juvenile 208 57.4
## 10 35057 Steelhead juvenile 240 20.2
## # … with 26 more rows
Arrange sorts the dataframe based on a specific variable or variables
## # A tibble: 97 x 5
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 35095 Chinook salmon subyearling 90 NA
## 2 35097 Sockeye salmon juvenile 110 NA
## 3 35099 Sockeye salmon juvenile 111 NA
## 4 35036 Sockeye salmon juvenile 112 NA
## 5 35098 Sockeye salmon juvenile 112 NA
## 6 35147 Sockeye salmon juvenile 115 NA
## 7 35096 Sockeye salmon juvenile 115 NA
## 8 35100 Sockeye salmon juvenile 118 NA
## 9 35035 Sockeye salmon juvenile 121 NA
## 10 35119 Sockeye salmon juvenile 122 NA
## # … with 87 more rows
# Arrange the data first based on the variable common_name, then secondly based on length_mm in a descending order.
arrange(df1, common_name, desc(length_mm))## # A tibble: 97 x 5
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 35033 Chinook salmon mixed age juvenile 444 62.1
## 2 35146 Chinook salmon mixed age juvenile 422 101.
## 3 35110 Chinook salmon mixed age juvenile 275 81.5
## 4 35129 Chinook salmon yearling 225 72.7
## 5 35103 Chinook salmon yearling 216 81.2
## 6 35115 Chinook salmon yearling 215 53.5
## 7 35112 Chinook salmon yearling 205 90.5
## 8 35143 Chinook salmon yearling 204 80.9
## 9 35079 Chinook salmon yearling 199 53.2
## 10 35081 Chinook salmon yearling 196 5.56
## # … with 87 more rows
Mutate creates a new variable based on some form of computation
# A new variable is created based on the caluclation of the z-score of the variable IGF1_ng_ml using scale()
mutate(df1, scale(IGF1_ng_ml, center = TRUE, scale = TRUE))## # A tibble: 97 x 6
## salmon_id common_name age_classbyleng… length_mm IGF1_ng_ml `scale(IGF1_ng_…
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 35032 Chinook sal… yearling 147 41.3 -0.258
## 2 35035 Sockeye sal… juvenile 121 NA NA
## 3 35036 Sockeye sal… juvenile 112 NA NA
## 4 35037 Steelhead juvenile 220 42.7 -0.191
## 5 35038 Steelhead juvenile 152 NA NA
## 6 35033 Chinook sal… mixed age juven… 444 62.1 0.704
## 7 35034 Sockeye sal… juvenile 139 NA NA
## 8 35048 Steelhead juvenile 288 24.2 -1.04
## 9 35049 Steelhead juvenile 190 NA NA
## 10 35050 Steelhead juvenile 283 63.5 0.766
## # … with 87 more rows
# A new variable is created called IGFngml_zscore, based on the caluclation of the z-score of the variable IGF1_ng_ml using scale()
mutate(df1, IGFngml_zscore = scale(IGF1_ng_ml, center = TRUE, scale = TRUE))## # A tibble: 97 x 6
## salmon_id common_name age_classbyleng… length_mm IGF1_ng_ml IGFngml_zscore[…
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 35032 Chinook sal… yearling 147 41.3 -0.258
## 2 35035 Sockeye sal… juvenile 121 NA NA
## 3 35036 Sockeye sal… juvenile 112 NA NA
## 4 35037 Steelhead juvenile 220 42.7 -0.191
## 5 35038 Steelhead juvenile 152 NA NA
## 6 35033 Chinook sal… mixed age juven… 444 62.1 0.704
## 7 35034 Sockeye sal… juvenile 139 NA NA
## 8 35048 Steelhead juvenile 288 24.2 -1.04
## 9 35049 Steelhead juvenile 190 NA NA
## 10 35050 Steelhead juvenile 283 63.5 0.766
## # … with 87 more rows
Summarize applies aggregating or summary function to a group
# First we define the common_name as a group.
df1_byname <- group_by(df1, common_name)
# Summarise is used to count over the grouped common_names
summarise(df1_byname, count = n())## # A tibble: 4 x 2
## common_name count
## <chr> <int>
## 1 Chinook salmon 46
## 2 Coho salmon 2
## 3 Sockeye salmon 11
## 4 Steelhead 38
# Summarise is used to calculate mean IGF1_ng_ml over the grouped common_names
summarise(df1_byname, IGF1_ng_ml_ave = mean(IGF1_ng_ml, na.rm = T))## # A tibble: 4 x 2
## common_name IGF1_ng_ml_ave
## <chr> <dbl>
## 1 Chinook salmon 46.8
## 2 Coho salmon 73.6
## 3 Sockeye salmon NaN
## 4 Steelhead 46.1
Grouping can also help ask questions with other functions
# Filter obsevrations with the 2 smallest length_mm for each grouped common_names
filter(df1_byname, rank(length_mm) <= 2)## # A tibble: 8 x 5
## # Groups: common_name [4]
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 35038 Steelhead juvenile 152 NA
## 2 35055 Steelhead juvenile 123 55.7
## 3 35145 Chinook salmon yearling 130 23.4
## 4 35085 Coho salmon yearling 140 NA
## 5 35087 Coho salmon yearling 164 73.6
## 6 35095 Chinook salmon subyearling 90 NA
## 7 35097 Sockeye salmon juvenile 110 NA
## 8 35099 Sockeye salmon juvenile 111 NA
## # A tibble: 95 x 5
## # Groups: common_name [3]
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 35032 Chinook salmon yearling 147 41.3
## 2 35035 Sockeye salmon juvenile 121 NA
## 3 35036 Sockeye salmon juvenile 112 NA
## 4 35037 Steelhead juvenile 220 42.7
## 5 35038 Steelhead juvenile 152 NA
## 6 35033 Chinook salmon mixed age juvenile 444 62.1
## 7 35034 Sockeye salmon juvenile 139 NA
## 8 35048 Steelhead juvenile 288 24.2
## 9 35049 Steelhead juvenile 190 NA
## 10 35050 Steelhead juvenile 283 63.5
## # … with 85 more rows
# A new variable is created using z-score within the grouped common_names
mutate(df1_byname, IGFngml_zscore = scale(IGF1_ng_ml, center = TRUE, scale = TRUE))## # A tibble: 97 x 6
## # Groups: common_name [4]
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml IGFngml_zscore
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 35032 Chinook salm… yearling 147 41.3 -0.236
## 2 35035 Sockeye salm… juvenile 121 NA NA
## 3 35036 Sockeye salm… juvenile 112 NA NA
## 4 35037 Steelhead juvenile 220 42.7 -0.176
## 5 35038 Steelhead juvenile 152 NA NA
## 6 35033 Chinook salm… mixed age juveni… 444 62.1 0.652
## 7 35034 Sockeye salm… juvenile 139 NA NA
## 8 35048 Steelhead juvenile 288 24.2 -1.12
## 9 35049 Steelhead juvenile 190 NA NA
## 10 35050 Steelhead juvenile 283 63.5 0.890
## # … with 87 more rows
Piping allows you to pass the result from one expression directly into another.
-> same graphic as before , but extend https://github.com/trinker/tidyr_in_a_nutshell
# Without pipe
df1_byname <- group_by(df1, common_name)
summarise(df1_byname, IGF1_ng_ml_ave=mean(IGF1_ng_ml, na.rm=T))## # A tibble: 4 x 2
## common_name IGF1_ng_ml_ave
## <chr> <dbl>
## 1 Chinook salmon 46.8
## 2 Coho salmon 73.6
## 3 Sockeye salmon NaN
## 4 Steelhead 46.1
## # A tibble: 4 x 2
## common_name IGF1_ng_ml_ave
## <chr> <dbl>
## 1 Chinook salmon 46.8
## 2 Coho salmon 73.6
## 3 Sockeye salmon NaN
## 4 Steelhead 46.1
# (1) Group by common_name
# (2) Filter to all those that have length bigger then 200
# (3) Summarise is used to calculate mean IGF1_ng_ml over the grouped common_names for these larger fish
group_by(df1, common_name) %>%
filter(length_mm > 200) %>%
summarise(IGF1_ng_ml_ave = mean(IGF1_ng_ml, na.rm = T))## # A tibble: 2 x 2
## common_name IGF1_ng_ml_ave
## <chr> <dbl>
## 1 Chinook salmon 77.9
## 2 Steelhead 45.3
# (1) Create new variable that is discrete label depending on size of the fish
# (2) Group by common_name and size
# (3) Summarise is used to calculate mean IGF1_ng_ml over the grouped common_names and sizes
mutate(df1, size = if_else(length_mm > 200, 'big_fish', 'small_fish')) %>%
group_by(common_name, size) %>%
summarise(IGF1_ng_ml_ave = mean(IGF1_ng_ml, na.rm = T))## # A tibble: 6 x 3
## # Groups: common_name [4]
## common_name size IGF1_ng_ml_ave
## <chr> <chr> <dbl>
## 1 Chinook salmon big_fish 77.9
## 2 Chinook salmon small_fish 39.5
## 3 Coho salmon small_fish 73.6
## 4 Sockeye salmon small_fish NaN
## 5 Steelhead big_fish 45.3
## 6 Steelhead small_fish 47.3
# (1) Create new variable that is discrete label depending on size of the fish
# (2) Group by common_name and size
# (3) Summarise is used to calculate mean IGF1_ng_ml over the grouped common_names and sizes
# (4) Filter out Coho and Sockeye salmon
mutate(df1, size = if_else(length_mm > 200, 'big_fish', 'small_fish')) %>%
group_by(common_name, size) %>%
summarise(IGF1_ng_ml_ave = mean(IGF1_ng_ml, na.rm = T)) %>%
filter(common_name != 'Coho salmon') %>%
filter(common_name !='Sockeye salmon')## # A tibble: 4 x 3
## # Groups: common_name [2]
## common_name size IGF1_ng_ml_ave
## <chr> <chr> <dbl>
## 1 Chinook salmon big_fish 77.9
## 2 Chinook salmon small_fish 39.5
## 3 Steelhead big_fish 45.3
## 4 Steelhead small_fish 47.3
library(plotly)
p<-mutate(df1,size=if_else(length_mm>200, 'big_fish', 'small_fish')) %>%
group_by(common_name, size) %>%
summarise(IGF1_ng_ml_ave=mean(IGF1_ng_ml, na.rm=T)) %>%
filter(common_name!='Coho salmon') %>%
filter(common_name!='Sockeye salmon') %>%
ggplot( aes(x = common_name, y = IGF1_ng_ml_ave, group = size, fill = size)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 90))
ggplotly(p)CHALLENGE
ANSWER
So we blasted through what being tidy can give you. Now lets tidy some data. First step is to read in data.
ReadR:
#Base gives you everything
untidy_counts_base <- read.csv("~/Documents/Box Sync/RU/Teaching/teaching/tidyR/dataset/hemato_rnaseq_counts.csv")
head(untidy_counts_base)## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
#ReadR gives you a tibble
untidy_counts <- read_csv("~/Documents/Box Sync/RU/Teaching/teaching/tidyR/dataset/hemato_rnaseq_counts.csv")## Parsed with column specification:
## cols(
## ENTREZ = col_double(),
## CD34_1 = col_double(),
## ORTHO_1 = col_double(),
## CD34_2 = col_double(),
## ORTHO_2 = col_double()
## )
## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
#Tibbles carry and display extra information. While reading in it is easy to specify datatype.
untidy_counts <- read_csv("~/Documents/Box Sync/RU/Teaching/teaching/tidyR/dataset/hemato_rnaseq_counts.csv", col_types = cols(
ENTREZ = col_character(),
CD34_1 = col_integer(),
ORTHO_1 = col_integer(),
CD34_2 = col_integer(),
ORTHO_2 = col_integer()
))
untidy_counts## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <chr> <int> <int> <int> <int>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
## # A tibble: 100 x 1
## ENTREZ
## <chr>
## 1 350
## 2 351
## 3 353
## 4 354
## 5 355
## 6 356
## 7 357
## 8 358
## 9 359
## 10 360
## # … with 90 more rows
## # A tibble: 1 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <chr> <int> <int> <int> <int>
## 1 350 204 0 103 0
#Can also not specify which dimension you pull from. This will default to grabbing the column
untidy_counts[1]## # A tibble: 100 x 1
## ENTREZ
## <chr>
## 1 350
## 2 351
## 3 353
## 4 354
## 5 355
## 6 356
## 7 357
## 8 358
## 9 359
## 10 360
## # … with 90 more rows
#All the prior outputs have been outputting another tibble. If double brackets are used a vector is returned
untidy_counts[[1]]## [1] "350" "351" "353" "354" "355" "356" "357" "358" "359" "360" "361" "362"
## [13] "363" "364" "366" "367" "368" "369" "372" "373" "374" "375" "377" "378"
## [25] "379" "381" "382" "383" "384" "387" "388" "389" "390" "391" "392" "393"
## [37] "394" "395" "396" "397" "398" "399" "400" "401" "402" "403" "405" "406"
## [49] "407" "408" "409" "410" "411" "412" "414" "415" "416" "417" "419" "420"
## [61] "421" "427" "429" "430" "432" "433" "434" "435" "440" "443" "444" "445"
## [73] "460" "462" "463" "466" "467" "468" "471" "472" "473" "474" "475" "476"
## [85] "477" "478" "479" "480" "481" "482" "483" "486" "487" "488" "489" "490"
## [97] "491" "492" "493" "495"
## [1] "350" "351" "353" "354" "355" "356" "357" "358" "359" "360" "361" "362"
## [13] "363" "364" "366" "367" "368" "369" "372" "373" "374" "375" "377" "378"
## [25] "379" "381" "382" "383" "384" "387" "388" "389" "390" "391" "392" "393"
## [37] "394" "395" "396" "397" "398" "399" "400" "401" "402" "403" "405" "406"
## [49] "407" "408" "409" "410" "411" "412" "414" "415" "416" "417" "419" "420"
## [61] "421" "427" "429" "430" "432" "433" "434" "435" "440" "443" "444" "445"
## [73] "460" "462" "463" "466" "467" "468" "471" "472" "473" "474" "475" "476"
## [85] "477" "478" "479" "480" "481" "482" "483" "486" "487" "488" "489" "490"
## [97] "491" "492" "493" "495"
## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <int> <int> <int> <int> <int>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
#Once it is a tibble it is straightforward to modify the datatype
untidy_counts_base <- as_tibble(untidy_counts_base) %>%
mutate_at(vars(ENTREZ), as.character)
untidy_counts_base## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <chr> <int> <int> <int> <int>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
#Some tools are not tibble friendly. Calling as.data.frame is sufficient to convert it back to a base data frame
as.data.frame(untidy_counts_base)## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## 11 361 0 1 0 0
## 12 362 3 1 15 0
## 13 363 14 6 4 1
## 14 364 7 0 1 0
## 15 366 6 0 1 0
## 16 367 42 0 51 1
## 17 368 28 0 24 0
## 18 369 1204 1034 833 478
## 19 372 2829 1864 2741 771
## 20 373 179 728 148 795
## 21 374 76 5 138 2
## 22 375 4428 6697 4970 4328
## 23 377 3170 314 2576 11
## 24 378 1839 4845 1767 2975
## 25 379 178 1 181 0
## 26 381 1617 574 1159 339
## 27 382 2874 2265 1746 1668
## 28 383 63 1632 40 721
## 29 384 148 10977 118 94
## 30 387 8899 2457 7405 1228
## 31 388 12598 171 5090 70
## 32 389 2709 193 2313 5
## 33 390 1004 0 395 0
## 34 391 1038 577 1176 164
## 35 392 1527 304 786 71
## 36 393 2949 138 1540 3
## 37 394 1525 464 1062 134
## 38 395 348 67 123 0
## 39 396 6503 702 4723 169
## 40 397 12997 410 11265 38
## 41 398 0 0 0 0
## 42 399 223 11 422 2
## 43 400 1188 147 806 56
## 44 401 0 0 0 0
## 45 402 504 218 496 80
## 46 403 289 25 166 4
## 47 405 1481 824 1004 812
## 48 406 295 175 87 35
## 49 407 4 1 2 1
## 50 408 2451 111 1523 6
## 51 409 2480 1819 1356 226
## 52 410 433 197 215 77
## 53 411 829 217 441 131
## 54 412 312 45 138 17
## 55 414 516 15 396 9
## 56 415 20 0 13 0
## 57 416 2 1 4 0
## 58 417 0 0 0 0
## 59 419 6 5 2 7
## 60 420 141 1136 94 1217
## 61 421 213 255 93 208
## 62 427 4699 11889 1729 926
## 63 429 0 0 2 0
## 64 430 34 13 22 0
## 65 432 63 0 55 1
## 66 433 38 0 26 0
## 67 434 1 1 2 0
## 68 435 408 151 284 34
## 69 440 157 2520 111 535
## 70 443 5 0 4 0
## 71 444 1583 151 747 14
## 72 445 116 15 90 1
## 73 460 34 0 68 0
## 74 462 70 0 31 1
## 75 463 1244 118 492 71
## 76 466 538 480 393 218
## 77 467 2506 402 2130 18
## 78 468 7991 6132 5307 1883
## 79 471 1272 771 1392 53
## 80 472 1389 628 739 138
## 81 473 4173 783 1901 776
## 82 474 0 0 0 0
## 83 475 284 83 467 34
## 84 476 4952 3453 4202 1416
## 85 477 13 3 26 2
## 86 478 78 121 67 0
## 87 479 17 0 4 0
## 88 480 9 5 11 1
## 89 481 1937 18 1017 34
## 90 482 157 1392 75 1660
## 91 483 1075 1454 1789 1141
## 92 486 47 0 18 0
## 93 487 29 33 19 3
## 94 488 4529 1118 2925 269
## 95 489 3465 153 3188 8
## 96 490 1610 1263 913 665
## 97 491 12 1 4 0
## 98 492 4 0 6 0
## 99 493 5011 3585 3053 743
## 100 495 0 0 0 0
# Lets load in some packages
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
#Lets use the ENTREZ ID as a key
keys <- untidy_counts$ENTREZ
#We can use the ENTREZ ID to look up Gene Symbol
symbols <- select(org.Hs.eg.db, keys=keys,columns="SYMBOL", keytype="ENTREZID")
#We can use the ENTREZ ID to look up the chormosome the gene resides on
chrs <- select(TxDb.Hsapiens.UCSC.hg19.knownGene, keys=keys, columns="TXCHROM", keytype="GENEID")
#We can use the ENTREZ ID to get a list of genes with grange of their exons
geneExons <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene,by="gene")[keys]
#We will then use an apply to get the transcript length from each gene in the list. The transcript length is calculated by first flattening overalpping exons with reduce(), then calculating the length of each exon with width(), then summing upthe total exon length to get our transcript length.
txsLength <- sapply(geneExons, function(x){ x %>% reduce() %>% width() %>% sum() })
# FInally we have all this metadata. Lets put it together into a tibble.
counts_metadata <- tibble(ID=symbols$ENTREZID,SYMBOL=symbols$SYMBOL,CHR=chrs$TXCHROM,LENGTH=txsLength)## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <chr> <int> <int> <int> <int>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
A single variable with multiple columns
#Pivot longer allows you to collapse variables single varibles that are spread over multiple columns
tidier_counts<-pivot_longer(untidy_counts, cols=c(-ENTREZ), names_to = c("Sample"), values_to = "counts")
#Pivot wider allows you to spread single variables over multiple columns
pivot_wider(tidier_counts, names_from = c("Sample"), values_from = "counts")## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <chr> <int> <int> <int> <int>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
## # A tibble: 400 x 3
## ENTREZ Sample counts
## <chr> <chr> <int>
## 1 350 CD34_1 204
## 2 350 ORTHO_1 0
## 3 350 CD34_2 103
## 4 350 ORTHO_2 0
## 5 351 CD34_1 15586
## 6 351 ORTHO_1 479
## 7 351 CD34_2 10476
## 8 351 ORTHO_2 39
## 9 353 CD34_1 842
## 10 353 ORTHO_1 355
## # … with 390 more rows
Multiple variables in a single column
How do we get tidy? - Cleaning up
#Seperate allows you to break a strings in a vraible by a seperator. In this case the cell type and replicate number are broken by underscore
tidier_counts <- separate(tidier_counts, Sample, sep = "_", into=c("CellType","Rep"), remove=TRUE)
tidier_counts## # A tibble: 400 x 4
## ENTREZ CellType Rep counts
## <chr> <chr> <chr> <int>
## 1 350 CD34 1 204
## 2 350 ORTHO 1 0
## 3 350 CD34 2 103
## 4 350 ORTHO 2 0
## 5 351 CD34 1 15586
## 6 351 ORTHO 1 479
## 7 351 CD34 2 10476
## 8 351 ORTHO 2 39
## 9 353 CD34 1 842
## 10 353 ORTHO 1 355
## # … with 390 more rows
#Unite can go the other way if you want to generate a key. We can use this to make sure we have a Key.
unite(tidier_counts, Sample, CellType, Rep, remove=FALSE)## # A tibble: 400 x 5
## ENTREZ Sample CellType Rep counts
## <chr> <chr> <chr> <chr> <int>
## 1 350 CD34_1 CD34 1 204
## 2 350 ORTHO_1 ORTHO 1 0
## 3 350 CD34_2 CD34 2 103
## 4 350 ORTHO_2 ORTHO 2 0
## 5 351 CD34_1 CD34 1 15586
## 6 351 ORTHO_1 ORTHO 1 479
## 7 351 CD34_2 CD34 2 10476
## 8 351 ORTHO_2 ORTHO 2 39
## 9 353 CD34_1 CD34 1 842
## 10 353 ORTHO_1 ORTHO 1 355
## # … with 390 more rows
#A key difference compared to base is that it does not write out row names. Tibbles generally don't have rownames.
#Theres a wide range of writing options. Can specify the delmiter directly or use a speicfic function
write_delim(tidy_counts_expressed, '~/Documents/Box Sync/RU/Teaching/teaching/tidyR/expressed_genes_output.csv', delim =',')
write_csv(tidy_counts_expressed, '~/Documents/Box Sync/RU/Teaching/teaching/tidyR/expressed_genes_output.csv')If the data you are working with involves characters from data entry often there will be errors i.e. clinical study metadata or a hand-typed list of genes of interest. Tidying data also means fixing these problems. Stringr helps make this easy.
Though stringr is pretty comprehensive and covers most of what you will need, there is a sister package called stringi with even more functionality.
Many overlapping functions with base for combining, subsetting, converting and finding strings
brc <- c("Tom", "Ji-Dung", "Matt")
# Extract substrings from a range. Here the 1st to 3rd character
str_sub(brc, 1, 3)## [1] "Tom" "Ji-" "Mat"
## [1] "o" "i-Dun" "at"
## [1] "Tom" "Ji -Dung" "Matt"
#Can add whitespace to strings to get consistent length. Here all are 10 characters
str_pad(brc2, width=10, side='left')## [1] " Tom" " Ji -Dung" " Matt"
## [1] "APOH" "APOH" "APOH" "APOH" "APP" "APP"
## [7] "APP" "APP" "APRT" "APRT" "APRT" "APRT"
## [13] "FAS" "FAS" "FAS" "FAS" "FASLG" "FASLG"
## [19] "FASLG" "FASLG" "SHROOM2" "SHROOM2" "SHROOM2" "SHROOM2"
## [25] "AQP1" "AQP1" "AQP1" "AQP1" "AQP2" "AQP2"
## [31] "AQP2" "AQP2" "AQP3" "AQP3" "AQP3" "AQP3"
## [37] "AQP4" "AQP4" "AQP4" "AQP4" "AQP5" "AQP5"
## [43] "AQP5" "AQP5" "AQP6" "AQP6" "AQP6" "AQP6"
## [49] "AQP7" "AQP7" "AQP7" "AQP7" "AQP9" "AQP9"
## [55] "AQP9" "AQP9" "AR" "AR" "AR" "AR"
## [61] "ABCC6" "ABCC6" "ABCC6" "ABCC6" "ARAF" "ARAF"
## [67] "ARAF" "ARAF" "ARCN1" "ARCN1" "ARCN1" "ARCN1"
## [73] "TRIM23" "TRIM23" "TRIM23" "TRIM23" "AREG" "AREG"
## [79] "AREG" "AREG" "ARF1" "ARF1" "ARF1" "ARF1"
## [85] "ARF3" "ARF3" "ARF3" "ARF3" "ARF4" "ARF4"
## [91] "ARF4" "ARF4" "ARL4D" "ARL4D" "ARL4D" "ARL4D"
## [97] "ARF5" "ARF5" "ARF5" "ARF5" "ARF6" "ARF6"
## [103] "ARF6" "ARF6" "ARG1" "ARG1" "ARG1" "ARG1"
## [109] "ARG2" "ARG2" "ARG2" "ARG2" "RHOA" "RHOA"
## [115] "RHOA" "RHOA" "RHOB" "RHOB" "RHOB" "RHOB"
## [121] "RHOC" "RHOC" "RHOC" "RHOC" "RND3" "RND3"
## [127] "RND3" "RND3" "RHOG" "RHOG" "RHOG" "RHOG"
## [133] "ARHGAP1" "ARHGAP1" "ARHGAP1" "ARHGAP1" "ARHGAP4" "ARHGAP4"
## [139] "ARHGAP4" "ARHGAP4" "ARHGAP5" "ARHGAP5" "ARHGAP5" "ARHGAP5"
## [145] "ARHGAP6" "ARHGAP6" "ARHGAP6" "ARHGAP6" "ARHGDIA" "ARHGDIA"
## [151] "ARHGDIA" "ARHGDIA" "ARHGDIB" "ARHGDIB" "ARHGDIB" "ARHGDIB"
## [157] "RHOH" "RHOH" "RHOH" "RHOH" "ARL1" "ARL1"
## [163] "ARL1" "ARL1" "ARL2" "ARL2" "ARL2" "ARL2"
## [169] "ARL3" "ARL3" "ARL3" "ARL3" "ARNT" "ARNT"
## [175] "ARNT" "ARNT" "ARNTL" "ARNTL" "ARNTL" "ARNTL"
## [181] "ARR3" "ARR3" "ARR3" "ARR3" "ARRB1" "ARRB1"
## [187] "ARRB1" "ARRB1" "ARRB2" "ARRB2" "ARRB2" "ARRB2"
## [193] "ARSA" "ARSA" "ARSA" "ARSA" "ARSB" "ARSB"
## [199] "ARSB" "ARSB" "STS" "STS" "STS" "STS"
## [205] "ARSD" "ARSD" "ARSD" "ARSD" "ARSE" "ARSE"
## [211] "ARSE" "ARSE" "ARSF" "ARSF" "ARSF" "ARSF"
## [217] "ART3" "ART3" "ART3" "ART3" "ART4" "ART4"
## [223] "ART4" "ART4" "ARVCF" "ARVCF" "ARVCF" "ARVCF"
## [229] "ASAH1" "ASAH1" "ASAH1" "ASAH1" "ASCL1" "ASCL1"
## [235] "ASCL1" "ASCL1" "ASCL2" "ASCL2" "ASCL2" "ASCL2"
## [241] "ASGR1" "ASGR1" "ASGR1" "ASGR1" "ASGR2" "ASGR2"
## [247] "ASGR2" "ASGR2" "ASIP" "ASIP" "ASIP" "ASIP"
## [253] "ASL" "ASL" "ASL" "ASL" "ASNS" "ASNS"
## [259] "ASNS" "ASNS" "ASPA" "ASPA" "ASPA" "ASPA"
## [265] "ASPH" "ASPH" "ASPH" "ASPH" "ASS1" "ASS1"
## [271] "ASS1" "ASS1" "ASTN1" "ASTN1" "ASTN1" "ASTN1"
## [277] "SERPINC1" "SERPINC1" "SERPINC1" "SERPINC1" "ZFHX3" "ZFHX3"
## [283] "ZFHX3" "ZFHX3" "ATF1" "ATF1" "ATF1" "ATF1"
## [289] "ATF3" "ATF3" "ATF3" "ATF3" "ATF4" "ATF4"
## [295] "ATF4" "ATF4" "ATIC" "ATIC" "ATIC" "ATIC"
## [301] "ATM" "ATM" "ATM" "ATM" "RERE" "RERE"
## [307] "RERE" "RERE" "ATOX1" "ATOX1" "ATOX1" "ATOX1"
## [313] "ATP1A1" "ATP1A1" "ATP1A1" "ATP1A1" "ATP1A2" "ATP1A2"
## [319] "ATP1A2" "ATP1A2" "ATP1A3" "ATP1A3" "ATP1A3" "ATP1A3"
## [325] "ATP12A" "ATP12A" "ATP12A" "ATP12A" "ATP1A4" "ATP1A4"
## [331] "ATP1A4" "ATP1A4" "ATP1B1" "ATP1B1" "ATP1B1" "ATP1B1"
## [337] "ATP1B2" "ATP1B2" "ATP1B2" "ATP1B2" "ATP1B3" "ATP1B3"
## [343] "ATP1B3" "ATP1B3" "FXYD2" "FXYD2" "FXYD2" "FXYD2"
## [349] "ATP2A1" "ATP2A1" "ATP2A1" "ATP2A1" "ATP2A2" "ATP2A2"
## [355] "ATP2A2" "ATP2A2" "ATP2A3" "ATP2A3" "ATP2A3" "ATP2A3"
## [361] "ATP2B1" "ATP2B1" "ATP2B1" "ATP2B1" "ATP2B2" "ATP2B2"
## [367] "ATP2B2" "ATP2B2" "ATP2B3" "ATP2B3" "ATP2B3" "ATP2B3"
## [373] "ATP2B4" "ATP2B4" "ATP2B4" "ATP2B4"
## [1] "Apoh" "Apoh" "Apoh" "Apoh" "App" "App"
## [7] "App" "App" "Aprt" "Aprt" "Aprt" "Aprt"
## [13] "Fas" "Fas" "Fas" "Fas" "Faslg" "Faslg"
## [19] "Faslg" "Faslg" "Shroom2" "Shroom2" "Shroom2" "Shroom2"
## [25] "Aqp1" "Aqp1" "Aqp1" "Aqp1" "Aqp2" "Aqp2"
## [31] "Aqp2" "Aqp2" "Aqp3" "Aqp3" "Aqp3" "Aqp3"
## [37] "Aqp4" "Aqp4" "Aqp4" "Aqp4" "Aqp5" "Aqp5"
## [43] "Aqp5" "Aqp5" "Aqp6" "Aqp6" "Aqp6" "Aqp6"
## [49] "Aqp7" "Aqp7" "Aqp7" "Aqp7" "Aqp9" "Aqp9"
## [55] "Aqp9" "Aqp9" "Ar" "Ar" "Ar" "Ar"
## [61] "Abcc6" "Abcc6" "Abcc6" "Abcc6" "Araf" "Araf"
## [67] "Araf" "Araf" "Arcn1" "Arcn1" "Arcn1" "Arcn1"
## [73] "Trim23" "Trim23" "Trim23" "Trim23" "Areg" "Areg"
## [79] "Areg" "Areg" "Arf1" "Arf1" "Arf1" "Arf1"
## [85] "Arf3" "Arf3" "Arf3" "Arf3" "Arf4" "Arf4"
## [91] "Arf4" "Arf4" "Arl4d" "Arl4d" "Arl4d" "Arl4d"
## [97] "Arf5" "Arf5" "Arf5" "Arf5" "Arf6" "Arf6"
## [103] "Arf6" "Arf6" "Arg1" "Arg1" "Arg1" "Arg1"
## [109] "Arg2" "Arg2" "Arg2" "Arg2" "Rhoa" "Rhoa"
## [115] "Rhoa" "Rhoa" "Rhob" "Rhob" "Rhob" "Rhob"
## [121] "Rhoc" "Rhoc" "Rhoc" "Rhoc" "Rnd3" "Rnd3"
## [127] "Rnd3" "Rnd3" "Rhog" "Rhog" "Rhog" "Rhog"
## [133] "Arhgap1" "Arhgap1" "Arhgap1" "Arhgap1" "Arhgap4" "Arhgap4"
## [139] "Arhgap4" "Arhgap4" "Arhgap5" "Arhgap5" "Arhgap5" "Arhgap5"
## [145] "Arhgap6" "Arhgap6" "Arhgap6" "Arhgap6" "Arhgdia" "Arhgdia"
## [151] "Arhgdia" "Arhgdia" "Arhgdib" "Arhgdib" "Arhgdib" "Arhgdib"
## [157] "Rhoh" "Rhoh" "Rhoh" "Rhoh" "Arl1" "Arl1"
## [163] "Arl1" "Arl1" "Arl2" "Arl2" "Arl2" "Arl2"
## [169] "Arl3" "Arl3" "Arl3" "Arl3" "Arnt" "Arnt"
## [175] "Arnt" "Arnt" "Arntl" "Arntl" "Arntl" "Arntl"
## [181] "Arr3" "Arr3" "Arr3" "Arr3" "Arrb1" "Arrb1"
## [187] "Arrb1" "Arrb1" "Arrb2" "Arrb2" "Arrb2" "Arrb2"
## [193] "Arsa" "Arsa" "Arsa" "Arsa" "Arsb" "Arsb"
## [199] "Arsb" "Arsb" "Sts" "Sts" "Sts" "Sts"
## [205] "Arsd" "Arsd" "Arsd" "Arsd" "Arse" "Arse"
## [211] "Arse" "Arse" "Arsf" "Arsf" "Arsf" "Arsf"
## [217] "Art3" "Art3" "Art3" "Art3" "Art4" "Art4"
## [223] "Art4" "Art4" "Arvcf" "Arvcf" "Arvcf" "Arvcf"
## [229] "Asah1" "Asah1" "Asah1" "Asah1" "Ascl1" "Ascl1"
## [235] "Ascl1" "Ascl1" "Ascl2" "Ascl2" "Ascl2" "Ascl2"
## [241] "Asgr1" "Asgr1" "Asgr1" "Asgr1" "Asgr2" "Asgr2"
## [247] "Asgr2" "Asgr2" "Asip" "Asip" "Asip" "Asip"
## [253] "Asl" "Asl" "Asl" "Asl" "Asns" "Asns"
## [259] "Asns" "Asns" "Aspa" "Aspa" "Aspa" "Aspa"
## [265] "Asph" "Asph" "Asph" "Asph" "Ass1" "Ass1"
## [271] "Ass1" "Ass1" "Astn1" "Astn1" "Astn1" "Astn1"
## [277] "Serpinc1" "Serpinc1" "Serpinc1" "Serpinc1" "Zfhx3" "Zfhx3"
## [283] "Zfhx3" "Zfhx3" "Atf1" "Atf1" "Atf1" "Atf1"
## [289] "Atf3" "Atf3" "Atf3" "Atf3" "Atf4" "Atf4"
## [295] "Atf4" "Atf4" "Atic" "Atic" "Atic" "Atic"
## [301] "Atm" "Atm" "Atm" "Atm" "Rere" "Rere"
## [307] "Rere" "Rere" "Atox1" "Atox1" "Atox1" "Atox1"
## [313] "Atp1a1" "Atp1a1" "Atp1a1" "Atp1a1" "Atp1a2" "Atp1a2"
## [319] "Atp1a2" "Atp1a2" "Atp1a3" "Atp1a3" "Atp1a3" "Atp1a3"
## [325] "Atp12a" "Atp12a" "Atp12a" "Atp12a" "Atp1a4" "Atp1a4"
## [331] "Atp1a4" "Atp1a4" "Atp1b1" "Atp1b1" "Atp1b1" "Atp1b1"
## [337] "Atp1b2" "Atp1b2" "Atp1b2" "Atp1b2" "Atp1b3" "Atp1b3"
## [343] "Atp1b3" "Atp1b3" "Fxyd2" "Fxyd2" "Fxyd2" "Fxyd2"
## [349] "Atp2a1" "Atp2a1" "Atp2a1" "Atp2a1" "Atp2a2" "Atp2a2"
## [355] "Atp2a2" "Atp2a2" "Atp2a3" "Atp2a3" "Atp2a3" "Atp2a3"
## [361] "Atp2b1" "Atp2b1" "Atp2b1" "Atp2b1" "Atp2b2" "Atp2b2"
## [367] "Atp2b2" "Atp2b2" "Atp2b3" "Atp2b3" "Atp2b3" "Atp2b3"
## [373] "Atp2b4" "Atp2b4" "Atp2b4" "Atp2b4"
## # A tibble: 376 x 11
## # Groups: Sample [4]
## ENTREZ Sample CellType Rep counts count_total CPM SYMBOL CHR LENGTH
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <chr> <chr> <int>
## 1 350 CD34_1 CD34 1 204 307 1.36e3 Apoh chr17 1201
## 2 350 ORTHO… ORTHO 1 0 307 0. Apoh chr17 1201
## 3 350 CD34_2 CD34 2 103 307 9.75e2 Apoh chr17 1201
## 4 350 ORTHO… ORTHO 2 0 307 0. Apoh chr17 1201
## 5 351 CD34_1 CD34 1 15586 26580 1.04e5 App chr21 4480
## 6 351 ORTHO… ORTHO 1 479 26580 5.82e3 App chr21 4480
## 7 351 CD34_2 CD34 2 10476 26580 9.91e4 App chr21 4480
## 8 351 ORTHO… ORTHO 2 39 26580 1.19e3 App chr21 4480
## 9 353 CD34_1 CD34 1 842 2471 5.62e3 Aprt chr16 807
## 10 353 ORTHO… ORTHO 1 355 2471 4.32e3 Aprt chr16 807
## # … with 366 more rows, and 1 more variable: TPM <dbl>
## # A tibble: 376 x 11
## # Groups: Sample [4]
## ENTREZ Sample CellType Rep counts count_total CPM SYMBOL CHR LENGTH
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <chr> <chr> <int>
## 1 350 CD34_1 CD34 1 204 307 1.36e3 APOH CHR17 1201
## 2 350 ORTHO… ORTHO 1 0 307 0. APOH CHR17 1201
## 3 350 CD34_2 CD34 2 103 307 9.75e2 APOH CHR17 1201
## 4 350 ORTHO… ORTHO 2 0 307 0. APOH CHR17 1201
## 5 351 CD34_1 CD34 1 15586 26580 1.04e5 APP CHR21 4480
## 6 351 ORTHO… ORTHO 1 479 26580 5.82e3 APP CHR21 4480
## 7 351 CD34_2 CD34 2 10476 26580 9.91e4 APP CHR21 4480
## 8 351 ORTHO… ORTHO 2 39 26580 1.19e3 APP CHR21 4480
## 9 353 CD34_1 CD34 1 842 2471 5.62e3 APRT CHR16 807
## 10 353 ORTHO… ORTHO 1 355 2471 4.32e3 APRT CHR16 807
## # … with 366 more rows, and 1 more variable: TPM <dbl>
#Find patterns in different ways
#Detect gives a T/F whether the pattern 'salmon' is present in vector
df1 %>% dplyr::pull(common_name) %>% str_detect('salmon')## [1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## [73] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [97] TRUE
#Subset returns the match if the pattern 'salmon' is present in vector
df1 %>% dplyr::pull(common_name) %>% str_subset('salmon') ## [1] "Chinook salmon" "Sockeye salmon" "Sockeye salmon" "Chinook salmon"
## [5] "Sockeye salmon" "Chinook salmon" "Chinook salmon" "Sockeye salmon"
## [9] "Chinook salmon" "Chinook salmon" "Sockeye salmon" "Chinook salmon"
## [13] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [17] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Coho salmon"
## [21] "Coho salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [25] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [29] "Sockeye salmon" "Sockeye salmon" "Sockeye salmon" "Sockeye salmon"
## [33] "Sockeye salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [37] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [41] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Sockeye salmon"
## [45] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [49] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [53] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [57] "Chinook salmon" "Chinook salmon" "Chinook salmon"
#Ends is similar to detect as it gives gives a T/F whether the pattern 'salmon' is present in vector, but the pattern has to be at the end.
df1 %>% dplyr::pull(common_name) %>% str_ends('salmon') ## [1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## [73] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [97] TRUE
## # A tibble: 59 x 5
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 35032 Chinook salmon yearling 147 41.3
## 2 35035 Sockeye salmon juvenile 121 NA
## 3 35036 Sockeye salmon juvenile 112 NA
## 4 35033 Chinook salmon mixed age juvenile 444 62.1
## 5 35034 Sockeye salmon juvenile 139 NA
## 6 35142 Chinook salmon yearling 149 66.5
## 7 35143 Chinook salmon yearling 204 80.9
## 8 35144 Sockeye salmon juvenile 140 NA
## 9 35145 Chinook salmon yearling 130 23.4
## 10 35146 Chinook salmon mixed age juvenile 422 101.
## # … with 49 more rows
#Count gives you the total number of times your pattern appears in each chracter in the vector
df1 %>% dplyr::pull(common_name) %>% str_count('salmon')## [1] 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0
## [39] 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1
## [77] 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1] 3 2 2 0 0 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 2 3 3 2 0 0
## [39] 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 0 3 0 0 3 3 3 3 3 3 3 3
## [77] 3 2 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [1] "Chinook salmon" "Sockeye salmon" "Sockeye salmon" "Steelhead trout"
## [5] "Steelhead trout" "Chinook salmon" "Sockeye salmon" "Steelhead trout"
## [9] "Steelhead trout" "Steelhead trout" "Steelhead trout" "Steelhead trout"
## [13] "Steelhead trout" "Steelhead trout" "Steelhead trout" "Steelhead trout"
## [17] "Steelhead trout" "Steelhead trout" "Steelhead trout" "Steelhead trout"
## [21] "Steelhead trout" "Steelhead trout" "Steelhead trout" "Steelhead trout"
## [25] "Steelhead trout" "Steelhead trout" "Steelhead trout" "Steelhead trout"
## [29] "Steelhead trout" "Steelhead trout" "Chinook salmon" "Chinook salmon"
## [33] "Sockeye salmon" "Chinook salmon" "Chinook salmon" "Sockeye salmon"
## [37] "Steelhead trout" "Steelhead trout" "Steelhead trout" "Steelhead trout"
## [41] "Steelhead trout" "Steelhead trout" "Chinook salmon" "Chinook salmon"
## [45] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [49] "Chinook salmon" "Chinook salmon" "Coho salmon" "Coho salmon"
## [53] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [57] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Sockeye salmon"
## [61] "Sockeye salmon" "Sockeye salmon" "Sockeye salmon" "Sockeye salmon"
## [65] "Steelhead trout" "Chinook salmon" "Steelhead trout" "Steelhead trout"
## [69] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [73] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [77] "Chinook salmon" "Sockeye salmon" "Steelhead trout" "Steelhead trout"
## [81] "Steelhead trout" "Steelhead trout" "Chinook salmon" "Chinook salmon"
## [85] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [89] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [93] "Chinook salmon" "Chinook salmon" "Chinook salmon" "Chinook salmon"
## [97] "Chinook salmon"
## # A tibble: 97 x 5
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 35032 Chinook salmon yearling 147 41.3
## 2 35035 Sockeye salmon juvenile 121 NA
## 3 35036 Sockeye salmon juvenile 112 NA
## 4 35037 Steelhead trout juvenile 220 42.7
## 5 35038 Steelhead trout juvenile 152 NA
## 6 35033 Chinook salmon mixed age juvenile 444 62.1
## 7 35034 Sockeye salmon juvenile 139 NA
## 8 35048 Steelhead trout juvenile 288 24.2
## 9 35049 Steelhead trout juvenile 190 NA
## 10 35050 Steelhead trout juvenile 283 63.5
## # … with 87 more rows
applying functions to datasets
#forcats
Hadley Wickham (Chief Scientist at RStudio) is the driving force behind the tidyverse.
Hadley wrote a paper about why he thinks tidy data is best: www.jstatsoft.org/v59/i10/paper.
There is a lot of support for all things tidy at: https://www.tidyverse.org/
readxl: This package is very useful when you want to import Excel sheets in R googledrive: Interact with your googledrive through R
lubridate and hms: Allow managin of calendar and time formats
magrittr:
GGplot here
tidy workbook